This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Epidemic curves

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Packages

This code chunk shows the loading of packages required for the analyses.

# Create vector of names of required packages:
packages_epicurve <- c("rio",          # File import
                       "here",         # File locator
                       "tidyverse",    # data manipulation
                       "ggplot2",      # Produce plots and graphs
                       "aweek",        # working with dates
                       "lubridate",    # Manipulate dates    
                       "incidence",    # an option for epicurves of linelist data
                       "stringr",      # Search and manipulate character strings
                       "forcats",      # working with factors
                       "RColorBrewer", # Color palettes from colorbrewer2.org
                       "DT"            # produce tables for this html handbook
                       )               ### close vector of packages

# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(packages_epicurve, character.only=TRUE)

Load data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated Ebola epidemic
  • Aggregated counts by hospital from the same simulated Ebola epidemic

If viewing in Google Chrome, you can access these datasets in Microsoft Excel by clicking HERE and HERE. TODO.

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.

For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.

# import the linelist into R
linelist <- rio::import("ebola_simulated.xlsx")

Do cleaning steps as necessary!!! To Do

Review data

Review the two datasets and notice the differences

Linelist dataset

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Aggregated counts dataset

# display the linelist data as a table
DT::datatable(count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Set parameters

You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). In this case, we set this date as 27 July 2013.

Now we can reference the object data_date into the code and have it reference that date.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2013-07-27")

Clean dates

Optionally, it can be nice to identify all the date variables and store their names in a vector. This can be done by individually naming them, or by searching for them by looking for keywords.

Method 1

# Method 1. Define date variables explicitly in a vector
DateVars <- c("date_onset",
              "date_hospitalisation",
              "date_outcome",
              "date_infection"
              )
DateVars
## [1] "date_onset"           "date_hospitalisation" "date_outcome"        
## [4] "date_infection"

Method 2

# Method 2: Search for date columns
DateVars <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))
DateVars
## [1] "date_infection"       "date_onset"           "date_hospitalisation"
## [4] "date_outcome"

#Note: other search tool options within vars_select include contains() ends_with(), starts_with(), or one_of()

Ensure date variables are date class

Verify

Verify that each variable was successfully converted to date class by printing statistics and a quick histogram for each one.

# To verify successful conversion of date variables

# Creates list of column numbers of date variables 
varNums <- c()
for (varName in DateVars) {
  varNum <- match(varName, names(linelist))
  varNums <- c(varNums, varNum)
}

# Produce output for each date variable converted
for (varNum in varNums) {
  varName <- names(linelist)[varNum]         # get name of variable
  class <- class(linelist[, varNum])                  # get class
  missing <- sum(is.na(linelist[, varNum]))  # get number missing values
  hist(linelist[, varNum],                   # histogram
       breaks = 50,
       main = paste0("Histogram of: ", varName, ", Class: ", class, ", Missing: ", missing),
       xlab = varName)  
  }

incidence package

Below are tabs on using the “incidence” package

Outline

This section shows variations on the epicurve using the incidence package

#Determine the number of observations that are missing a date of onset

# Number of observations
n_linelist <- nrow(linelist)
n_linelist
## [1] 5888

# number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_onset),])
missing_onset
## [1] 242

Simple

These are simple epicurves using the incidence package. The epicurve is assigned to the object “epicurve”, which is then plotted. Remember that incidence::plot() is different to base::plot()

The interval defines how the observations are grouped. Options are all those in the package aweek, including but are not limited to:
* “Monday week” * “2 Monday weeks” * “Sunday week”
* “MMWRweek” (starts on Sunday)
* “Month”
* “Quarter”
* “Year”

First date and last date can also be specified.

# incidence object is created, with data aggregated at one day intervals
epicurve_daily   <- incidence::incidence(linelist$date_onset, interval = "day")

# If weekly, you can specific the start day
epicurve_weekly  <- incidence::incidence(linelist$date_onset, interval = "Monday week")
epicurve_3weekly  <- incidence::incidence(linelist$date_onset, interval = "3 weeks")

# Monthly
epicurve_monthly <- incidence::incidence(linelist$date_onset, interval = "month")

# Plot the incidence object
plot(epicurve_daily)
plot(epicurve_weekly)
plot(epicurve_3weekly)
plot(epicurve_monthly)

Aesthetics

Behind the scenes, incidence is using ggplot(), so you can add aesthetic themes and other lines using the ggplot syntax.

# Set theme elements using ggplot syntax
epicurve_theme <- ggplot2::theme(
  axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), 
  legend.title = element_blank(),
  panel.grid.major.x = element_line(color = "grey60", linetype = 3),
  panel.grid.major.y = element_line(color = "grey60", linetype = 3)
)

# Sets labels using ggplot syntax
epicurve_labels <- labs(x = "Week", 
                        y = "Cases (n)", 
                        title = "H5N7 cases by week of onset",
                        caption = paste0("Source: Linelist data from: ", data_date, "; ", missing_onset, " are missing date of onset and not shown."))
                       


# plot the epicurve with aesthetics
nice_plot <- plot(epicurve_weekly, show_cases = TRUE, border = "black", n_breaks = nrow(epicurve_weekly)) + 
  scale_y_continuous(expand = c(0, 0)) +  # set origin for axes
  # add labels
  epicurve_labels +
  # add theme
  epicurve_theme
nice_plot

# Modify nice_plot to show only 6 breaks in the x-axis 
nice_plot + scale_x_incidence(epicurve_weekly, n_breaks = 6)

By descriptor

Now differentiating the cases by gender, using the groups = argument in the incidence command

# Create epiweek object, with counts grouped by gender
epicurve_weekly_gender <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$gender,
                               na_as_group = FALSE)   # Prevents missing values from being assigned their own group


# Plot the epicurve
# Note: Remove the boxes around each case as it makes gender colours hard to see! (show_cases = FALSE)
nice_plot <- plot(epicurve_weekly_gender, show_cases = FALSE, border = "black", n_breaks = nrow(epicurve_weekly_gender)) + 
  
  # add labels (defined in previous section)
  epicurve_labels +
  
  # add theme elements
  epicurve_theme

nice_plot

Filtered

To filter data, This version is filtered to only show data from a specific province.

# filter the dataset and pass it to the incidence() function
Connaught_data <- filter(linelist, hospital == "Connaught Hospital")

epicurve_Connaught <- incidence(Connaught_data$date_onset,
                                interval = "week",
                                groups = Connaught_data$gender)


# Re-sets labels, changing title to reflect subset
epicurve_labels <- labs(x = "Week", 
                        y = "Cases", 
                        title = "Ebola cases by week of onset in Connaught Hospital",
                        caption = paste0("Source: Linelist data from: ", data_date, "; ", missing_onset, " are missing date of onset and not shown."))
                       

# plot as before
plot(epicurve_Connaught, show_cases = TRUE, border = "grey") +
     
  # add labels (defined in previous section)
  epicurve_labels +
  
  # add theme elements
  epicurve_theme

ggplot2

Below are tabs on using “ggplot2” package

Daily and weekly


# Daily case counts 
###################
plot_daily <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by day (1 days)
  stat_bin(binwidth = 1, position="stack") 

print(plot_daily)


# Weekly case counts 
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position="stack", fill = "brown") 

print(plot_weekly)

With aesthetics

# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_onset, factor = TRUE, numeric = TRUE)

# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))

# Weekly case counts 
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
  
  # X-axis 21-day labels
  scale_x_date( 
    # Sets date label breaks as every 3 weeks from Monday before the first case
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm=T), by = "1 week"),
    
    # axis limits determined by max/min + buffer
    limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), 
    
    # displays as date number, then abbreviated month (e.g. 12 Oct)
    date_labels = "%d-%b",   
    
    # sets origin at (0,0)
    expand = c(0,0)) +                                
  
  # Y-axis breaks every 5 cases
  scale_y_continuous(breaks = seq(0, ymax, 5),
                     limits = c(0, ymax),
                     expand = c(0, 0)) +
  
  # Theme specifications (axis, text, etc.)
  theme(# title
        plot.title = element_text(size=20, hjust= 0, face="bold"),   # title size, font, bold
        # axes
        axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
        axis.text = element_text(size=12),
        axis.title = element_text(size=14, face="bold"),
        axis.line = element_line(colour="black"),
        # background
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        # caption (italics, on right side)
        plot.caption = element_text(hjust = 0, face = "italic")
        ) +
  
  guides(fill = guide_legend(reverse = TRUE,                   # Orders Non-active zones at end of legend
                             override.aes = list(size = 0.2),
                             ncol = 2)) +                        # Number of legend columns
  
  labs(x = "Week of illness onset", 
       y = "Number of cases",
       subtitle = "subtitle here",
       caption = paste0(nrow(linelist),
                        " confirmed and probable cases, reported as of ", data_date, ". ",
                        missing_onset, " cases missing date of onset and not shown.")) +
  
  ggtitle("Epidemic curve")


# print
print(plot_weekly)

By category

Colored by a category

# Setup
########
# Two known classes (select colors from colorbrewer2.org)
colors_overall = c("#d95f02",  # 
                   "#1b9e77",
                   "#7570b3")  #    

# Order sex variable by reverse # of cases, so plot stacks with smallest # of cases at top
linelist$gender <- factor(linelist$gender, 
                         levels = levels(fct_rev(fct_infreq(linelist$gender))))

# Calculates maximum yaxis height for uniformity between the two graphs
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_onset),])



# PLOT - BY ONSET DATE
######################
plot_defined_cats <- ggplot(linelist, aes(x = date_onset, fill = gender)) + 
  
  # stacked bars, width of 7 days
  stat_bin(binwidth = 7, position = "stack") +
  
  # Colors and labels of confirmed/probable
  scale_fill_manual(values = rev(colors_overall),
                    labels = str_to_sentence(levels(factor(linelist$gender)))) +
  
  # X-axis scale labels (not aggregation, just the labels)
  scale_x_date(# Sets date label breaks as every week
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
    limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date # then abbreviated month (e.g. 12 Oct)
    expand = c(0, 0)) +                                # sets origin at (0,0)
  
  # Y-scale in breaks, up to the ymax previously defined
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
  
  # Themes for axes, titles, background, etc.
  theme(plot.title       = element_text(size=20, hjust=0.5, face="bold"),
        axis.text        = element_text(size=12),
        axis.title       = element_text(size=14, face="bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line        = element_line(colour = "black"),
        axis.text.x      = element_text(angle=90, vjust=0.5, hjust=1)) +
  
  # Legend specifications
  theme(legend.title           = element_blank(),
        legend.justification   = c(0, 1), 
        legend.position        = c(0.09, 0.98),
        legend.background      = element_blank(),
        legend.text            = element_text(size = 12)) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
  
  # Axis and caption labels
  labs(x = "Week of illness onset",
       y = "Number of Cases",
       caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
  
  # Title
  ggtitle("Cases by week of illness onset")

# print
print(plot_defined_cats)

By active area

# PARAMETERS
#############

# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))


# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_onset > (data_date - 90))])
active_zones

# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
  filter(province %in% active_zones) %>%
  group_by(province) %>%
  summarise(cases = n())
order_table

# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)

# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))

# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
  
  # Value assignments
  # Non-active zones
  linelist$active_zone == FALSE    ~ "Non-active zones",
  # All others are assigned their names, capitalized
  TRUE  ~ stringr::str_to_title(linelist$province)),
  
  # Order of variable levels
  levels = c(
    # "Non-active zones" is first level
    "Non-active zones",  
    
    # Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
    str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))  


table(linelist$graph_zone, useNA = "ifany")


# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1 

# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
                "#1b9e77", # turquoise green
                "#ff7f00", # orange
                "#ffff33", # yellow
                "#6a3d9a", # purple
                "#b15928", # brown
                "#1f78b4", # blue
                "#e31a1c", # red,
                "#fb9a99", # pink
                "#b2df8a", # light green 
                "#cab2d6", # light purple
                "#a6cee3", # light blue
                "#fdbf6f", # beige
                "#33a02c"  # green
)

# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))


# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_onset, fill = graph_zone)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position = "stack") +
  
  # Fill of bars
  scale_fill_manual(values = colors_linelist, 
                    labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
  
  # X-axis 21-day labels
  scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
    limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date number, then abbreviated month (e.g. 12 Oct)
    expand = c(0,0)) +                                # sets origin at (0,0)
  
  # Y-axis breaks every 5 cases
  scale_y_continuous(breaks = seq(0, ymax, 5),
                     limits = c(0, ymax),
                     expand = c(0, 0)) +
  
  # Theme specifications (axis, text, etc.)
  theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"),   # title size, font, bold
        axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
        axis.text = element_text(size=12),
        axis.title = element_text(size=14, face="bold"),
        axis.line = element_line(colour="black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.caption = element_text(hjust = 0, face = "italic")
        ) +
  
  # Legend specifications
  theme(legend.title = element_blank(),                        # No legend title
        legend.position = c(0.20, 0.85),                         # placement of legend
        legend.background = element_blank(),                     # legend background   
        legend.text = element_text(size=12)) +                 # legend text size
  
  guides(fill = guide_legend(reverse = TRUE,                   # Orders Non-active zones at end of legend
                             override.aes = list(size = 0.2),
                             ncol = 2)) +                        # Number of legend columns
  
  labs(x = "Week of illness onset", 
       y = "Number of cases",
       subtitle = "Health zones with cases in the last 42 days specified by color",
       caption = paste0(nrow(linelist),
                        " confirmed and probable cases, reported as of ", data_date, ". ",
                        missing_onset, " cases missing date of onset and not shown.",
                        "\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
  
  ggtitle("Epidemic curve by active health zones")


# print
plot_overall

By grouped areas


#SETUP
#############
# Filter to health zone of interest
zone_data <- linelist

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))

# Assign health area groups (individual for HAs of interest, groups others together)
linelist$graph_areas <- factor(case_when(
  linelist$province == "Shanghai"    ~ "Shanghai",
  linelist$province == "Jiangsu"     ~ "Jiangsu",
  linelist$province == "Zhejiang"    ~ "Zhejiang",
  TRUE                                ~ "Other (10)"
),
# Levels part of the factor function assigns order of appearance
levels = c(                         
  "Other (10)",
  "Shanghai",
  "Jiangsu",
  "Zhejiang"
)
)

# checks
table(linelist$graph_areas, useNA = "ifany")

# Color assignments
colors_needed <- length(unique(linelist$graph_areas, na.rm=T)) - 1 # number of colors needed

# list of colors
colors_aire = c("#a6cee3", 
                           "#1f78b4",
                           "#b2df8a", 
                           "#33a02c",
                           "#fb9a99",
                           "#e31a1c",
                           "#fdbf6f",
                           "#ff7f00",
                           "#cab2d6",
                           "#6a3d9a",
                           "#ffff99",
                           "#b15928"
                           )

# Reduce number of colors to only the number needed
colors_aire <- c("#cccccc", rev(colors_aire[1:colors_needed]))


# Plot of province
#####################################
plot <- ggplot(linelist, aes(x = date_onset, fill = graph_areas)) + 
  
  stat_bin(binwidth = 7, position="stack") +
  
  scale_fill_manual(values = colors_aire, labels = str_to_sentence(levels(factor(linelist$graph_areas)))) +
  
  scale_x_date(date_breaks = "1 week", date_labels = "%d-%b", limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), expand=c(0,0)) + # I used the date onset variable here so x axes will be the same
  
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
  
  theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
        plot.caption = element_text(hjust = 0, face = "italic"),
        
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        
        
        axis.line = element_line(colour = "black"),
        axis.text = element_text(size = 12),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        axis.title = element_text(size = 14, face = "bold"),
        
        legend.title = element_blank(),
        legend.justification = c(0,1), 
        legend.position = c(0.05, 1),
        legend.background = element_blank(),
        legend.text = element_text(size = 12)) +
  
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 4)) +
  
  labs(x="Week of illness onset", 
       y="Number of cases",
       subtitle = "",
       caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset.")) +
  
  ggtitle("Cases of influenza, by province")

plot

By “wave” in time

# Define the waves
##################
# zone_data <- filter(linelist, zone_de_sante == "mabalako")
# 
# zone_data$wave <- case_when(
#   zone_data$date_onset >= as.Date("2018-03-01") &
#     zone_data$date_onset < as.Date("2018-10-25")     ~ "Wave 1",
#   
#   zone_data$date_onset >= as.Date("2018-10-25") &
#     zone_data$date_onset < as.Date("2019-02-01")     ~ "Wave 2",
#   
#   zone_data$date_onset >= as.Date("2019-02-01") &
#     zone_data$date_onset < as.Date("2019-09-15")     ~ "Wave 3",
#   
#   zone_data$date_onset >= as.Date("2019-09-15")      ~ "Wave 4",
#   
#   TRUE ~ NA_character_
# )
# 
# table(is.na(zone_data$date_onset))
# table(zone_data$wave, useNA = "always")
# 
# 
# # Color assignments
# colors_needed <- length(unique(zone_data$wave, na.rm=T)) # number of colors needed
# 
# # list of colors
# colors_aire = c("#a6cee3", 
#                            "#1f78b4",
#                            "#b2df8a", 
#                            "#33a02c",
#                            "#fb9a99",
#                            "#e31a1c",
#                            "#fdbf6f",
#                            "#ff7f00",
#                            "#cab2d6",
#                            "#6a3d9a",
#                            "#ffff99",
#                            "#b15928"
#                            )
# 
# # Reduce number of colors to only the number needed
# colors_aire <- c(rev(colors_aire[1:colors_needed]))
# 
# 
# # Plot of health zone colored by wave
# #####################################
# plot_Mabalako <- ggplot(zone_data, aes(x = date_onset, fill = wave)) + 
#   
#   stat_bin(binwidth = 7, position = "stack") +
#   
#   scale_fill_manual(values = rev(colors_aire), labels = str_to_sentence(levels(factor(zone_data$wave)))) +
#   
#   scale_x_date(date_breaks = "21 days", date_labels = "%d-%b",
#                limits = c((min(zone_data$date_onset, na.rm = T) - 8), (max(zone_data$date_report, na.rm = T) + 8)), expand = c(0,0)) +
#   
#   scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
#   
#   theme(text = element_text(family = "Segoe Condensed"),
#         axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
#         axis.text = element_text(size = 12),
#         axis.title = element_text(size = 14, face = "bold"),
#         axis.line = element_line(colour = "black"),
#         
#         plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
#         plot.caption = element_text(hjust = 0, face = "italic"),
#         
#         panel.grid.major = element_blank(),
#         panel.grid.minor = element_blank(),
#         panel.background = element_blank(),
#         
#         legend.title = element_blank(),
#         legend.justification = c(0,1), 
#         legend.position = c(0.75, 0.98),
#         legend.background = element_blank(),
#         legend.text = element_text(size=12)) +
#   
#   guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 1)) +
#   
#   labs(x="Week of illness onset", 
#        y="Number of cases",
#        subtitle = "",
#        caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset and 16 excluded due to uncertain health zone of report.")) +
#   
#   ggtitle("Four waves of EVD in Mabalako health zone")
# 
# plot_Mabalako
# 
# 
# # Produce table describing each wave
# ####################################
# table <- zone_data %>%
#   select("aire_de_sante", "wave", "community_death", "date_onset", "cte_date", "epicasedef", "community_death", "contact_registered", "contact_surveilled") %>%
#   group_by(wave) %>%
#   summarise(first_onset       = min(date_onset, na.rm = T),
#             last_admission    = max(cte_date, na.rm = T),
#             n                 = n(),
#             confirmed         = sum(epicasedef == "confirmed"),
#             community_deaths  = paste0(sum(community_death    == 1), 
#                                        " (", round(100*sum(community_death == 1)/confirmed),"%)"),
#             reg_contacts      = paste0(sum(contact_registered == "yes"),
#                                        " (", round(100*sum(contact_registered == "yes")/confirmed),"%)"),
#             surv_contacts     = paste0(sum(contact_surveilled == "yes"),
#                                        " (", round(100*sum(contact_surveilled == "yes")/confirmed),"%)"),
#             top               = paste(toupper(names(sort(table(aire_de_sante),decreasing=TRUE)[1:3])), collapse=", ",
#                                       round(100*(sort(table(aire_de_sante),decreasing=TRUE)[1:3]/confirmed)), "%"),
#             health_areas      = paste(toupper(unique(aire_de_sante)), collapse=', ') 
#             )
# 
# kable(table)

Aggregated data

Situation

Often you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.

This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.

Note: The incidence package does not support aggregate data

Clean dates

As before, we must ensure date variables are correctly classified.

# Convert Date variable to Date class
class(count_data$date_hospitalisation)
## [1] "POSIXct" "POSIXt"
count_data$date_hospitalisation <- as.Date(count_data$date_hospitalisation)

Create weeks

# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$date_hospitalisation,      # use the Date variable
                                        week_start = "Monday", # epiweek begins on Monday
                                        floor_day = TRUE,      # only display year and week #
                                        factor = TRUE)         # expand to include all possible weeks

Clean dates

ggplot(data = count_data, aes(x = as.Date(epiweek), y = n_cases, group = hospital, fill = hospital))+
     geom_bar(stat = "identity")+
     
     # LABELS for x-axis
     scale_x_date(date_breaks = "1 month",  # displays by month
                  date_labels = '%b%d\n%Y')+  #labeled by month with year below
     
     # Choose color palette (uses RColorBrewer package)
     scale_fill_brewer(palette = "Pastel1")+      
     
     # Theme specifications (axis, text, etc.)
     theme(
          # title
          plot.title = element_text(size=20, hjust= 0, face="bold"),   # title size, font, bold
          # axes
          axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
          axis.text = element_text(size=12),
          axis.title = element_text(size=14, face="bold"),
          axis.line = element_line(colour="black"),
          # background
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          # caption (italics, on right side)
          plot.caption = element_text(hjust = 0, face = "italic"))+
     
     # labels
     labs(x = "Week of report", 
          y = "Number of cases",
          subtitle = "Cases aggregated by week and shown by hospital",
          caption = "Data source: XXXXX")+
     
     ggtitle("Epidemic curve of disease X in fictional location")

Dual-axis

Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.

In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.

Second axis

This uses the linelist dataset

TODO not complete yet

library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
     mutate(onset_epiweek = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
     group_by(onset_epiweek) %>%
     summarize(num_male = sum(gender == "m"),
               num_female = sum(gender == "f"),
               pct_male = round(100*(num_male / n())),
               med_age  = median(as.numeric(age), na.rm=T)
               ) 
# remove pct and melt
linelist_week_melted <- linelist_week %>%
     select(-c("pct_male", "med_age")) %>%
     melt(id.vars = c("onset_epiweek"))

# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
                              linelist_week,
                              by = "onset_epiweek")

second_axis <- ggplot(linelist_week_melted,
                      aes(x = as.Date(onset_epiweek),
                          y = value, group = variable,
                          fill = variable)) + 
  
  # bars
  geom_bar(stat = "identity")+
  
  # Colors and labels of confirmed/probable
  scale_fill_manual(values = c("blue", "red"),
                    labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
  
  geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
     
  scale_color_manual(values = "black")+
     
  scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+


  # X-axis scale labels (not aggregation, just the labels)
  scale_x_date(# Sets date label breaks as every week
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
    limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date # then abbreviated month (e.g. 12 Oct)
    expand = c(0, 0)) +                                # sets origin at (0,0)
  
  # Y-scale in breaks, up to the ymax previously defined
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
  
  # Themes for axes, titles, background, etc.
  theme(plot.title       = element_text(size=20, hjust=0.5, face="bold"),
        axis.text        = element_text(size=12),
        axis.title       = element_text(size=14, face="bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line        = element_line(colour = "black"),
        axis.text.x      = element_text(angle=90, vjust=0.5, hjust=1)) +
  
  # Legend specifications
  theme(legend.title           = element_blank(),
        legend.justification   = c(0, 1), 
        legend.position        = c(0.09, 0.98),
        legend.background      = element_blank(),
        legend.text            = element_text(size = 12)) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
  
  # Axis and caption labels
  labs(x = "Week of illness onset",
       y = "Number of Cases",
       caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
  
  # Title
  ggtitle("Cases by week of illness onset")

second_axis

# print
print(plot_defined_cats)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.